Author- Akash Tiwari (229028)
This notebook aims at segmenting customers using the bank transaction data having 20,000 rows sampled randomly from a dataset consisting of 10 lakh bank transaction. For segmenting the data is preprocessed, converted into RFM metrics and then scaled to be feeded to three algorithms k-means clustering, Gaussian Mixture Model and DBSCAN.
Contents of the Notebook :
#Importing the necessary Libraries
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time
%matplotlib inline
from sklearn import preprocessing
from sklearn.cluster import KMeans,DBSCAN
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from datetime import datetime,date
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
import seaborn as sns
try:
import plotly.express as px
import plotly.graph_objects as go
except:
!pip install plotly
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
import scipy.cluster.hierarchy as sch
from sklearn. preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
try:
from kneed import KneeLocator
except:
!pip install kneed
from kneed import KneeLocator
#------------------------------------------------------------------
try:
from yellowbrick.cluster import KElbowVisualizer
except:
!pip install -U yellowbrick
from yellowbrick.cluster import KElbowVisualizer
#Importing the dataset
df=pd.read_csv('bank_transactions_red.csv')
# Printing the first five rows to get the view of the dataset
print(f'The dataset has {df.shape[0]} records and {df.shape[1]} columns')
df.head()
The dataset has 20000 records and 9 columns
| TransactionID | CustomerID | CustomerDOB | CustGender | CustLocation | CustAccountBalance | TransactionDate | TransactionTime | TransactionAmount (INR) | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | T892846 | C5725279 | 8/9/84 | M | JASPUR | 84447.82 | 7/9/16 | 183210 | 11858.0 |
| 1 | T444995 | C4588538 | 2/1/94 | M | GURGAON | 12549.49 | 16/8/16 | 161300 | 250.0 |
| 2 | T614897 | C2416476 | 14/11/90 | M | NEW DELHI | 33607.65 | 26/8/16 | 102007 | 3360.0 |
| 3 | T457037 | C5940151 | 15/9/90 | M | HYDERABAD | 38238.86 | 21/8/16 | 110438 | 250.0 |
| 4 | T997442 | C5922989 | 27/11/88 | M | PURBO MEDINIPUR | 9506.85 | 14/9/16 | 90810 | 33.0 |
The dataset captures the transactions for a bank in India. It has the following columns :
# Now we see the data types and the count of non null values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TransactionID 20000 non-null object 1 CustomerID 20000 non-null object 2 CustomerDOB 19936 non-null object 3 CustGender 19980 non-null object 4 CustLocation 19999 non-null object 5 CustAccountBalance 19953 non-null float64 6 TransactionDate 20000 non-null object 7 TransactionTime 20000 non-null int64 8 TransactionAmount (INR) 20000 non-null float64 dtypes: float64(2), int64(1), object(6) memory usage: 1.4+ MB
## The function defined below records the data type, number of unique values and the number of null values for each column in the dataframe
def check(df):
l=[]
columns=df.columns
for col in columns:
dtypes=df[col].dtypes
nunique=df[col].nunique()
sum_null=df[col].isnull().sum()
l.append([col,dtypes,nunique,sum_null])
df_check=pd.DataFrame(l)
df_check.columns=['column','dtypes','nunique','sum_null']
return df_check
check(df)
| column | dtypes | nunique | sum_null | |
|---|---|---|---|---|
| 0 | TransactionID | object | 20000 | 0 |
| 1 | CustomerID | object | 19932 | 0 |
| 2 | CustomerDOB | object | 7742 | 64 |
| 3 | CustGender | object | 2 | 20 |
| 4 | CustLocation | object | 1866 | 1 |
| 5 | CustAccountBalance | float64 | 17236 | 47 |
| 6 | TransactionDate | object | 54 | 0 |
| 7 | TransactionTime | int64 | 17036 | 0 |
| 8 | TransactionAmount (INR) | float64 | 5492 | 0 |
From the above dataframe we can see that we have 20000 unique transactions with 64 null values in CustomerDOB, 20 null values in CustGender,1 null values in CustLocation and 47 null values in CustAccountBalance. As the number of null values is less as compared with the total records, we can drop the null records from the dataframe.
Also the datatype of CustomerDOB and TransactionTime is string so for further analysis and calculations these columns are needed to be converted into datetime datatype.
#Dropping the null values in the original dataframe
df.dropna(inplace=True)
df.isnull().sum()
TransactionID 0 CustomerID 0 CustomerDOB 0 CustGender 0 CustLocation 0 CustAccountBalance 0 TransactionDate 0 TransactionTime 0 TransactionAmount (INR) 0 dtype: int64
Now the dataframe has no null values, we can proceed with further preprocessing.
#Converting the datatype of TransactionDate and CustomerDOB from string to Datetime
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])
df['CustomerDOB'].value_counts()
1800-01-01 1059
1990-01-07 19
1991-01-01 18
1990-10-08 16
1989-01-01 15
...
1995-01-18 1
2068-10-30 1
1983-05-29 1
1997-01-29 1
2071-12-19 1
Name: CustomerDOB, Length: 7726, dtype: int64
We can observe that the DOD 1800-01-01 has frequency 5385 but while calculating the age of such customers we will face an anamoly. Customer born in the year 1800 or the 18th century wont alive in the current era and also they wont be relevant in the behavioral study of current generation. This is why we drop the records having DOB as 1800-01-01.
Also if we observe there are DOB with year more than 2023 which again is an anamoly. One of the choice is to drop such records but this can be a recording error and this is why instead of dropping we are going to reduce the year by 100 for records having birth year greater than 2023
df = df.drop(df[df['CustomerDOB'] == '1800-01-01'].index,axis = 0)
df.loc[df.CustomerDOB.dt.year >= 2023, 'CustomerDOB'] = df.loc[df.CustomerDOB.dt.year >= 2023, 'CustomerDOB'] - pd.DateOffset(years = 100)
df.head()
| TransactionID | CustomerID | CustomerDOB | CustGender | CustLocation | CustAccountBalance | TransactionDate | TransactionTime | TransactionAmount (INR) | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | T892846 | C5725279 | 1984-08-09 | M | JASPUR | 84447.82 | 2016-07-09 | 183210 | 11858.0 |
| 1 | T444995 | C4588538 | 1994-02-01 | M | GURGAON | 12549.49 | 2016-08-16 | 161300 | 250.0 |
| 2 | T614897 | C2416476 | 1990-11-14 | M | NEW DELHI | 33607.65 | 2016-08-26 | 102007 | 3360.0 |
| 3 | T457037 | C5940151 | 1990-09-15 | M | HYDERABAD | 38238.86 | 2016-08-21 | 110438 | 250.0 |
| 4 | T997442 | C5922989 | 1988-11-27 | M | PURBO MEDINIPUR | 9506.85 | 2016-09-14 | 90810 | 33.0 |
df['CustomerDOB'].value_counts()
1990-01-07 19
1991-01-01 18
1990-10-08 16
1989-01-01 15
1994-01-01 14
..
1995-01-18 1
1968-10-30 1
1983-05-29 1
1997-01-29 1
1971-12-19 1
Name: CustomerDOB, Length: 7725, dtype: int64
All the anomalies in the CustomerDOB column have been treated now we can calculate the age using this column and add to the dataframe the age column
df['Customer_age'] = ((pd.to_datetime('today') - df['CustomerDOB'])/np.timedelta64(1, 'Y')).round(0)
df.head()
| TransactionID | CustomerID | CustomerDOB | CustGender | CustLocation | CustAccountBalance | TransactionDate | TransactionTime | TransactionAmount (INR) | Customer_age | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T892846 | C5725279 | 1984-08-09 | M | JASPUR | 84447.82 | 2016-07-09 | 183210 | 11858.0 | 39.0 |
| 1 | T444995 | C4588538 | 1994-02-01 | M | GURGAON | 12549.49 | 2016-08-16 | 161300 | 250.0 | 29.0 |
| 2 | T614897 | C2416476 | 1990-11-14 | M | NEW DELHI | 33607.65 | 2016-08-26 | 102007 | 3360.0 | 32.0 |
| 3 | T457037 | C5940151 | 1990-09-15 | M | HYDERABAD | 38238.86 | 2016-08-21 | 110438 | 250.0 | 32.0 |
| 4 | T997442 | C5922989 | 1988-11-27 | M | PURBO MEDINIPUR | 9506.85 | 2016-09-14 | 90810 | 33.0 | 34.0 |
The Customer_age column has been added to the dataframe
## Checking the duplicate values
df.duplicated().sum()
0
The dataset has no duplicate values to handle
#Storing the numerical and catergorical columns in two different variables
num_col = df.select_dtypes(include=np.number)
cat_col = df.select_dtypes(exclude=np.number)
plt.style.use("fivethirtyeight")
plt.figure(figsize=(30,30))
for index,column in enumerate(num_col):
plt.subplot(7,4,index+1)
sns.boxplot(data=num_col,x=column)
plt.tight_layout(pad = 1.0)
The above boxplot shows that the CustAccountBalance, TransactionAmount and Customer_age has many outliers. Either we can remove these outliers or substitute them with relevant values as per the distribution of the observations. But Since this is a Bank Transaction data, and which has been collected over time,I have decided to not manipulate the outliers. The reason being that these columns can have significantly varying values as per the customers purchasing power and financial background. And these boundary values further can help in segmentation of a common behavioural pattern.
RFM became popular in mailing marketing during the 60s. This technique is still being used and applied to digital services and used to influence hundreds (and even millions) of eCommerce clients.
The benefits of RFM:
Here, Each of the three variables(Recency, Frequency, and Monetary) consists of four equal groups, which creates 64 (4x4x4) different customer segments.
Steps of RFM(Recency, Frequency, Monetary):
For further understanding of RFM below links can be referred
df['TransactionDate1']=df['TransactionDate'] # ==> to calculate the minimum (first transaction)
df['TransactionDate2']=df['TransactionDate'] # ==> to calculate the maximum (last transaction)
Now for creating a RFM table the columns of the dataframe needs to be aggregated accordingly so that we can calculate the Recency, frequency and the monetary value. In the below cell the aggregation strategy is defined on the df dataframe.
#Creating RMF Table Strategy
RMF_df = df.groupby("CustomerID").agg({
"TransactionID" : "count",
"CustGender" : "first",
"CustLocation":"first",
"CustAccountBalance" : "mean",
"TransactionAmount (INR)" : "mean",
"Customer_age" : "median",
"TransactionDate2":"max",
"TransactionDate1":"min",
"TransactionDate":"median"
})
RMF_df = RMF_df.reset_index()
RMF_df.head()
| CustomerID | TransactionID | CustGender | CustLocation | CustAccountBalance | TransactionAmount (INR) | Customer_age | TransactionDate2 | TransactionDate1 | TransactionDate | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | C1010024 | 1 | M | KOLKATA | 87058.65 | 5000.0 | 58.0 | 2016-08-18 | 2016-08-18 | 2016-08-18 00:00:00 |
| 1 | C1010243 | 2 | M | TEHRI GARHWAL | 98513.39 | 464.5 | 41.0 | 2016-03-08 | 2016-02-08 | 2016-02-22 12:00:00 |
| 2 | C1010279 | 1 | F | ANDHERI WEST MUMBAI | 137.91 | 368.0 | 35.0 | 2016-08-13 | 2016-08-13 | 2016-08-13 00:00:00 |
| 3 | C1010517 | 1 | M | SRIPERUMBUDUR | 14750.09 | 150.0 | 28.0 | 2016-05-08 | 2016-05-08 | 2016-05-08 00:00:00 |
| 4 | C1010655 | 1 | M | MEERUT | 45856.24 | 712.0 | 32.0 | 2016-08-24 | 2016-08-24 | 2016-08-24 00:00:00 |
Shape of the newly created RMF dataframe
RMF_df.shape
(18747, 10)
We drop the CustomerID column from the RMF dataframe as it's not necessary with the further analysis
RMF_df.drop(columns=["CustomerID"],inplace=True)
The strategy that we have implemented for the TransactionID column in RMF table is nothing but the count of number of transaction for each customer, which is nothing but the frequency. This is why in the next cell the column is renamed to frequency.
The average TransactionAmount is the monetary value for each of the customers, so we rename that column as well.
#Defining the Frequency
RMF_df.rename(columns={"TransactionID":"Frequency","TransactionAmount (INR)":"Monetary"},inplace=True)
Recency is the time period between the transaction. and we will calculate that by calculating the days between the latest and the previous older transaction. We implement this strategy using the two coloums Transaction Date1 (latest transaction) and TransactionDate2 (oldest transaction).
#Defining the Recency
RMF_df['Recency']=RMF_df['TransactionDate2']-RMF_df['TransactionDate1']
RMF_df['Recency']=RMF_df['Recency'].astype(str)
RMF_df['Recency']=RMF_df['Recency'].apply(lambda x :re.search('\d+',x).group())
RMF_df['Recency']=RMF_df['Recency'].astype(int)
#0 days means that a customer has done transaction recently, one time by logic so I will convert 0 to 1
def rep_0(i):
if i==0:
return 1
else:
return i
RMF_df['Recency']=RMF_df['Recency'].apply(rep_0)
RMF_df.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | TransactionDate2 | TransactionDate1 | TransactionDate | Recency | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | M | KOLKATA | 87058.65 | 5000.0 | 58.0 | 2016-08-18 | 2016-08-18 | 2016-08-18 00:00:00 | 1 |
| 1 | 2 | M | TEHRI GARHWAL | 98513.39 | 464.5 | 41.0 | 2016-03-08 | 2016-02-08 | 2016-02-22 12:00:00 | 29 |
| 2 | 1 | F | ANDHERI WEST MUMBAI | 137.91 | 368.0 | 35.0 | 2016-08-13 | 2016-08-13 | 2016-08-13 00:00:00 | 1 |
| 3 | 1 | M | SRIPERUMBUDUR | 14750.09 | 150.0 | 28.0 | 2016-05-08 | 2016-05-08 | 2016-05-08 00:00:00 | 1 |
| 4 | 1 | M | MEERUT | 45856.24 | 712.0 | 32.0 | 2016-08-24 | 2016-08-24 | 2016-08-24 00:00:00 | 1 |
The columns TransactionDate1 and TransactionDate2 will be droped as they were just needed to calculate the recency that we have already calculated.
RMF_df.drop(columns=["TransactionDate1","TransactionDate2"],inplace=True)
Now that we have the RMF table created we will look for any potential outliers in the data
# to claculate the otliers for each feature
lower_list=[]
upper_list=[]
num_list=[]
perc_list=[]
cols=['Frequency', 'CustAccountBalance','Monetary', 'Customer_age', 'Recency']
for i in cols:
Q1 = RMF_df[i].quantile(0.25)
Q3 = RMF_df[i].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
# calculate number of outliers
num=RMF_df[(RMF_df[i] < lower) | (RMF_df[i] > upper)].shape[0]
# calculate percentage of outliers
perc = (num / RMF_df.shape[0]) * 100
lower_list.append(lower)
upper_list.append(upper)
num_list.append(num)
perc_list.append(round(perc,2))
dic={'lower': lower_list, 'upper': upper_list, 'outliers': num_list, 'Perc%':perc_list }
outliers_df=pd.DataFrame(dic,index=['Frequency', 'CustAccountBalance','Monetary', 'Customer_age', 'Recency'])
outliers_df
| lower | upper | outliers | Perc% | |
|---|---|---|---|---|
| Frequency | 1.0000 | 1.0000 | 62 | 0.33 |
| CustAccountBalance | -68211.9425 | 125958.9175 | 2491 | 13.29 |
| Monetary | -1276.5000 | 2527.5000 | 2057 | 10.97 |
| Customer_age | 18.5000 | 54.5000 | 1008 | 5.38 |
| Recency | 1.0000 | 1.0000 | 57 | 0.30 |
As stated previously that the outliers present cant be removed as they themselves can be used to study the behavioural pattern of such clustered values
RMF_df['TransactionDay'] = RMF_df['TransactionDate'].dt.dayofweek
RMF_df['TransactionMonth'] = RMF_df['TransactionDate'].dt.month
RMF_df.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | TransactionDate | Recency | TransactionDay | TransactionMonth | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | M | KOLKATA | 87058.65 | 5000.0 | 58.0 | 2016-08-18 00:00:00 | 1 | 3 | 8 |
| 1 | 2 | M | TEHRI GARHWAL | 98513.39 | 464.5 | 41.0 | 2016-02-22 12:00:00 | 29 | 0 | 2 |
| 2 | 1 | F | ANDHERI WEST MUMBAI | 137.91 | 368.0 | 35.0 | 2016-08-13 00:00:00 | 1 | 5 | 8 |
| 3 | 1 | M | SRIPERUMBUDUR | 14750.09 | 150.0 | 28.0 | 2016-05-08 00:00:00 | 1 | 6 | 5 |
| 4 | 1 | M | MEERUT | 45856.24 | 712.0 | 32.0 | 2016-08-24 00:00:00 | 1 | 2 | 8 |
### Final RMF Dataframe
RMF_df.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | TransactionDate | Recency | TransactionDay | TransactionMonth | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | M | KOLKATA | 87058.65 | 5000.0 | 58.0 | 2016-08-18 00:00:00 | 1 | 3 | 8 |
| 1 | 2 | M | TEHRI GARHWAL | 98513.39 | 464.5 | 41.0 | 2016-02-22 12:00:00 | 29 | 0 | 2 |
| 2 | 1 | F | ANDHERI WEST MUMBAI | 137.91 | 368.0 | 35.0 | 2016-08-13 00:00:00 | 1 | 5 | 8 |
| 3 | 1 | M | SRIPERUMBUDUR | 14750.09 | 150.0 | 28.0 | 2016-05-08 00:00:00 | 1 | 6 | 5 |
| 4 | 1 | M | MEERUT | 45856.24 | 712.0 | 32.0 | 2016-08-24 00:00:00 | 1 | 2 | 8 |
# correlation between features
plt.figure(figsize=(7,5))
correlation=RMF_df.corr()
sns.heatmap(correlation,vmin=None,
vmax=0.8,
cmap='rocket_r',
annot=True,
fmt='.1f',
linecolor='white',
cbar=True);
The above map shows that we dont have to worry about the collinearity problem as none of the columns have significant correlation
plt.style.use("fivethirtyeight")
chart=sns.countplot(x='Frequency',data=RMF_df,palette='rocket', order = RMF_df['Frequency'].value_counts().index)
plt.title("Frequency",
fontsize='20',
backgroundcolor='AliceBlue',
color='magenta');
This shows that the frequency for the transactions for each customer is 1, that is we have the data where each customer has single transaction recorded for them
plt.hist(RMF_df['TransactionMonth'], color='yellow', edgecolor='black',bins = 10, linewidth=2)
plt.title('Transactions in each month')
plt.show()
Th graph shows that the number of transaction have spiked in the month of August
plt.hist(RMF_df['TransactionDay'], color='yellow', edgecolor='black',bins = 10, linewidth=2)
plt.title('Transactions in each day of the week')
plt.show()
The transactions are uniformly distributed throughout the days of a week
plt.style.use("fivethirtyeight")
fig,ax=plt.subplots(ncols=2,nrows=1,figsize=(15,5))
palette_color = sns.color_palette('rocket')
ax[0].hist(x=RMF_df['Customer_age'],color='purple')
ax[0].set_title("Distribution of Customer Age")
ax[1].pie(RMF_df['CustGender'].value_counts(),autopct='%1.f%%',colors=palette_color,labels=['Male','Female'])
ax[1].set_title("Customer Gender")
plt.tight_layout();
The Distribution of age is positively skewed, with the age group of 30-40 having the highest frequency.
The Customer gender is dominated by male records with 73% and only 27% female records
plt.style.use("fivethirtyeight")
plt.figure(figsize=(15,7))
chart=sns.countplot(y='CustLocation',data=RMF_df,palette='rocket', order = RMF_df['CustLocation'].value_counts()[:20].index)
plt.title("Most 20 Location of Customer ",
fontsize='20',
backgroundcolor='AliceBlue',
color='magenta');
This shows that the transactions recorded are mostly of customers belonging to Mumbai, Bangalore, New Delhi, Gurgaon, Delhi, Noida, Chennai and Pune. These are the major metropolitan areas
plt.style.use("fivethirtyeight")
sns.scatterplot(x='Monetary',y='CustAccountBalance',data=RMF_df,palette='rocket',hue='Frequency',size='Recency' )
plt.title("Monetary and CustAccountBalance",
fontsize='20',
backgroundcolor='AliceBlue',
color='magenta');
This is one of the most important plots where we can the spread of data with respect to Monetary and CustAccountBalance. Frequency is shows using different colors and recency is depicted using different size of the data points.
As the frequency is 1 for majority of the customers the data point has the same colour and the and even the recency is majorly 1 for the customers.
The majority behaviour corresponds to the customer having acoount balance between 0-1000000 and the monetary value from 0-8000 roughly.
# Sorting and grouping the columns monthwise to study the average behaviour overtime
RMF_df=RMF_df.sort_values(by='TransactionDate')
groupbby_month=RMF_df.groupby([pd.Grouper(key='TransactionDate',freq='M')]).mean()
print(groupbby_month.shape)
groupbby_month
(12, 7)
| Frequency | CustAccountBalance | Monetary | Customer_age | Recency | TransactionDay | TransactionMonth | |
|---|---|---|---|---|---|---|---|
| TransactionDate | |||||||
| 2016-01-31 | 1.000000 | 69906.068790 | 1400.891081 | 37.571429 | 1.000000 | 4.477477 | 1.0 |
| 2016-02-29 | 1.002656 | 82857.049880 | 1751.908858 | 37.745020 | 1.158035 | 0.543161 | 2.0 |
| 2016-03-31 | 1.000000 | 90811.376536 | 1339.061988 | 37.467857 | 1.000000 | 1.588095 | 3.0 |
| 2016-04-30 | 1.000000 | 99780.835910 | 1229.883830 | 37.484634 | 1.000000 | 4.578014 | 4.0 |
| 2016-05-31 | 1.002660 | 77862.819568 | 1421.804428 | 37.194814 | 1.523936 | 2.977394 | 5.0 |
| 2016-06-30 | 1.015421 | 79584.377165 | 1854.192491 | 37.495848 | 3.112693 | 2.435350 | 6.0 |
| 2016-07-31 | 1.009259 | 90065.473131 | 1601.971701 | 37.425347 | 2.035880 | 4.417824 | 7.0 |
| 2016-08-31 | 1.002204 | 99542.131787 | 1479.473098 | 37.877723 | 1.060296 | 2.857365 | 8.0 |
| 2016-09-30 | 1.004923 | 85972.026034 | 1310.581846 | 37.740155 | 1.182841 | 2.909634 | 9.0 |
| 2016-10-31 | 1.005501 | 94627.117992 | 1747.632068 | 38.239274 | 1.430143 | 5.418042 | 10.0 |
| 2016-11-30 | 1.001164 | 78637.287247 | 1206.780780 | 37.460419 | 1.000000 | 1.551804 | 11.0 |
| 2016-12-31 | 1.001337 | 101923.685107 | 1519.361190 | 38.229278 | 1.000000 | 3.466578 | 12.0 |
Average frequency remains the same throughout but if we observe the average Monetary was highest in the month of june and the recency was also highest in June.
plt.style.use("fivethirtyeight")
fig,ax=plt.subplots(ncols=2,nrows=1,figsize=(15,5))
ax[0].plot(groupbby_month.index,groupbby_month['CustAccountBalance'],color='purple',marker='o',label='Customer Account Balance')
ax[0].set_title('Cust Account Balance Over The Time')
ax[1].plot(groupbby_month.index,groupbby_month['Monetary'],color='purple',marker='o',label='Monetary')
ax[1].set_title("Monetary Over The Time")
plt.legend();
The plot shows the behaviour of average customer account balance and the recency over the period of different months.
As we would expect when the recency us low the account balance would be high and that is what we can observe in the graphs as well.
RMF_df=RMF_df.reset_index(drop=True)
RMF_df.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | TransactionDate | Recency | TransactionDay | TransactionMonth | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | M | BHOPAL | 33206.65 | 300.00 | 33.0 | 2016-01-08 | 1 | 4 | 1 |
| 1 | 1 | M | BANGALORE | 12819.30 | 20.00 | 38.0 | 2016-01-08 | 1 | 4 | 1 |
| 2 | 1 | M | THANE | 89365.55 | 1484.52 | 46.0 | 2016-01-08 | 1 | 4 | 1 |
| 3 | 1 | M | KOTA | 13956.38 | 1150.00 | 42.0 | 2016-01-08 | 1 | 4 | 1 |
| 4 | 1 | F | SALEM | 22091.49 | 500.00 | 39.0 | 2016-01-08 | 1 | 4 | 1 |
lab = preprocessing.LabelEncoder()
RMF_df['CustGender']= lab.fit_transform(RMF_df['CustGender'])
RMF_df['CustLocation']= lab.fit_transform(RMF_df['CustLocation'])
RMF_df.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | TransactionDate | Recency | TransactionDay | TransactionMonth | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 225 | 33206.65 | 300.00 | 33.0 | 2016-01-08 | 1 | 4 | 1 |
| 1 | 1 | 1 | 150 | 12819.30 | 20.00 | 38.0 | 2016-01-08 | 1 | 4 | 1 |
| 2 | 1 | 1 | 1538 | 89365.55 | 1484.52 | 46.0 | 2016-01-08 | 1 | 4 | 1 |
| 3 | 1 | 1 | 808 | 13956.38 | 1150.00 | 42.0 | 2016-01-08 | 1 | 4 | 1 |
| 4 | 1 | 0 | 1337 | 22091.49 | 500.00 | 39.0 | 2016-01-08 | 1 | 4 | 1 |
RMF_df.drop(['TransactionDate','TransactionDay','TransactionMonth'],axis=1,inplace=True)
RMF_df.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | Recency | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 225 | 33206.65 | 300.00 | 33.0 | 1 |
| 1 | 1 | 1 | 150 | 12819.30 | 20.00 | 38.0 | 1 |
| 2 | 1 | 1 | 1538 | 89365.55 | 1484.52 | 46.0 | 1 |
| 3 | 1 | 1 | 808 | 13956.38 | 1150.00 | 42.0 | 1 |
| 4 | 1 | 0 | 1337 | 22091.49 | 500.00 | 39.0 | 1 |
# data scaling
df_scaled=StandardScaler().fit_transform(RMF_df)
df_scaled=pd.DataFrame(df_scaled,columns=RMF_df.columns)
df_scaled.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | Recency | |
|---|---|---|---|---|---|---|---|
| 0 | -0.057165 | 0.608518 | -1.216462 | -0.159456 | -0.221783 | -0.543911 | -0.042985 |
| 1 | -0.057165 | 0.608518 | -1.389119 | -0.214844 | -0.275029 | 0.028657 | -0.042985 |
| 2 | -0.057165 | 0.608518 | 1.806193 | -0.006884 | 0.003472 | 0.944767 | -0.042985 |
| 3 | -0.057165 | 0.608518 | 0.125661 | -0.211755 | -0.060142 | 0.486712 | -0.042985 |
| 4 | -0.057165 | -1.643336 | 1.343471 | -0.189653 | -0.183750 | 0.143171 | -0.042985 |
The intention behind applying PCA is to get the features along which the variation of the data is explained majorly. This will help with further clustering.
pca = PCA()
pca_data = pca.fit_transform(df_scaled.iloc[:,:-1])
pca.explained_variance_ratio_
array([0.20695732, 0.17047326, 0.1665273 , 0.16488166, 0.15265303,
0.13850744])
plt.plot(list(range(1,7)),pca.explained_variance_ratio_)
plt.axis([0,14,0,max(pca.explained_variance_ratio_)+0.05])
plt.xticks(list(range(1,14)))
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()
plt.plot(list(range(1,7)),np.cumsum(pca.explained_variance_ratio_))
plt.axis([0,14,0,1.1])
plt.axhline(y=0.8,color='r',linestyle='--',linewidth=1)
plt.xticks(list(range(1,14)))
plt.xlabel('# of Principal Components')
plt.ylabel('Cummulative Variance Explained')
plt.show()
From the above analysis we conclude that 5 principal components are enough to explain almost 85 % of the variation in the data and this is why we will fit and evualate 5 principal components for further analysis.
pca = PCA(n_components=5)
pca_data = pca.fit_transform(df_scaled.iloc[:,:-1])
The k-means clustering is a method of vector quantization, originally from signal processing, that
The Elbow method is a graphical representation of finding the optimal 'K' in a K-means cluster
The Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1. 1: Means clusters are well apart from each other and clearly distinguished
The Dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering
import random
from yellowbrick.cluster import KElbowVisualizer
model = KMeans(init = 'k-means++', random_state = 42)
visualizer = KElbowVisualizer(model, k=(2,30), timings=False)
visualizer2 = KElbowVisualizer(model, k=(2,30), timings=False)
visualizer.fit(pca_data)
visualizer.show()
<AxesSubplot: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
As per the elbow plot having 8 cluster would be a fair guess for segmenting the customers
plt.figure(figsize=(30,10))
z = sch.linkage(pca_data[:350,:], 'ward')
sch.dendrogram(z);
even the dendogram shows that 8 clusters is good enough at the height 5 of the tree, for clustering of the given data
kmeans_set = {"init":"random", "max_iter":300, "random_state":42}
PCA_kmeans = KMeans(n_clusters=8 , **kmeans_set)
PCA_kmeans.fit(pca_data)
df_scaled['label'] = PCA_kmeans.labels_
ss_k=silhouette_score(pca_data,PCA_kmeans.labels_)
df_scaled.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | Recency | label | |
|---|---|---|---|---|---|---|---|---|
| 0 | -0.057165 | 0.608518 | -1.216462 | -0.159456 | -0.221783 | -0.543911 | -0.042985 | 1 |
| 1 | -0.057165 | 0.608518 | -1.389119 | -0.214844 | -0.275029 | 0.028657 | -0.042985 | 1 |
| 2 | -0.057165 | 0.608518 | 1.806193 | -0.006884 | 0.003472 | 0.944767 | -0.042985 | 4 |
| 3 | -0.057165 | 0.608518 | 0.125661 | -0.211755 | -0.060142 | 0.486712 | -0.042985 | 4 |
| 4 | -0.057165 | -1.643336 | 1.343471 | -0.189653 | -0.183750 | 0.143171 | -0.042985 | 2 |
df_scaled.label.value_counts()
1 6310 4 5999 6 2521 2 2306 5 1424 3 75 0 62 7 50 Name: label, dtype: int64
As cluter 0,7 and 3 have less representatives we won't plot them in the next plot.
attributes =df_scaled.columns[0:7]
color={2:'purple',4:'yellow',1:'red',5:'blue',6:'orange',7:'green'}
for index in color:
fig = go.Figure()
fig.add_trace(go.Scatterpolar(r=np.mean(df_scaled.loc[df_scaled.label == index, attributes]),
theta=attributes,
fill='toself',
fillcolor=color[index],
opacity=0.4,
name='cluster'+str(index)))
fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
title="Radar plot - K-Means Cluster"+str(index))
fig.show()
from the above plot we can interpret the following
Customers in cluster 6 have higher account balance and monetary values. it also shows that they have higher age and recency. So the bank can target this cluster for loans and other benefits
Cluster 2 on the other hand has younger customer with low monetary and account balance. Even the recency has moderate value in this cluster.
Cluster 5 has the oldest customers in the dataset with very low recency value, that is they don't transact frequently.
Gaussian mixture models are a probabilistic model for representing normally distributed subpopulations within an overall population. Mixture models in general don't require knowing which subpopulation a data point belongs to, allowing the model to learn the subpopulations automatically. Since subpopulation assignment is not known, this constitutes a form of unsupervised learning.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2).fit(pca_data)
labels = gmm.predict(pca_data)
df_scaled['label_gm'] = labels
ss_g=silhouette_score(pca_data,labels)
df_scaled.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | Recency | label | label_gm | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.057165 | 0.608518 | -1.216462 | -0.159456 | -0.221783 | -0.543911 | -0.042985 | 1 | 1 |
| 1 | -0.057165 | 0.608518 | -1.389119 | -0.214844 | -0.275029 | 0.028657 | -0.042985 | 1 | 1 |
| 2 | -0.057165 | 0.608518 | 1.806193 | -0.006884 | 0.003472 | 0.944767 | -0.042985 | 4 | 1 |
| 3 | -0.057165 | 0.608518 | 0.125661 | -0.211755 | -0.060142 | 0.486712 | -0.042985 | 4 | 1 |
| 4 | -0.057165 | -1.643336 | 1.343471 | -0.189653 | -0.183750 | 0.143171 | -0.042985 | 2 | 1 |
df_scaled.label_gm.value_counts()
1 16472 0 2275 Name: label_gm, dtype: int64
attributes =df_scaled.columns[0:7]
color={0:'purple',1:'yellow'}
for index in color:
fig = go.Figure()
fig.add_trace(go.Scatterpolar(r=np.mean(df_scaled.loc[df_scaled.label_gm == index, attributes]),
theta=attributes,
fill='toself',
fillcolor=color[index],
opacity=0.4,
name='cluster'+str(index)))
fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
title="Radar plot -Gaussian Mixture Model Cluster"+str(index))
fig.show()
The Gaussian Mixture Model gives better silhouette score for two clusters
After k-means, Gaussian Mixture Model has given two more clusters that can be used for studying the behaviours of the customer
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).
Why do we need DBSCAN?
K-Means and Hierarchical Clustering both fail in creating clusters of arbitrary shapes. They are not able to form clusters based on varying densities. That’s why we need DBSCAN clustering.
the parameter involved are epsilon( radius of the clusters) and min_samples(minimum number of points to form a cluster)
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
neighbors = NearestNeighbors(n_neighbors=100)
neighbors_fit = neighbors.fit(pca_data)
distances, indices = neighbors_fit.kneighbors(pca_data)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
[<matplotlib.lines.Line2D at 0x13eb8a61ea0>]
From the graph we can observe that the distance changes drastically from at around 0.8 This shows that after this distance the density of the points becomes less and we can use epsilon as 0.8 for fairly forming the clusters
dbscan_cluster_x = DBSCAN(eps=0.8, min_samples=100)
dbscan_cluster_x.fit(pca_data)
DBSCAN(eps=0.8, min_samples=100)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DBSCAN(eps=0.8, min_samples=100)
labels=dbscan_cluster_x.labels_
df_scaled['label_db'] = labels
ss_d=silhouette_score(pca_data,labels)
df_scaled.head()
| Frequency | CustGender | CustLocation | CustAccountBalance | Monetary | Customer_age | Recency | label | label_gm | label_db | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.057165 | 0.608518 | -1.216462 | -0.159456 | -0.221783 | -0.543911 | -0.042985 | 1 | 1 | 0 |
| 1 | -0.057165 | 0.608518 | -1.389119 | -0.214844 | -0.275029 | 0.028657 | -0.042985 | 1 | 1 | 0 |
| 2 | -0.057165 | 0.608518 | 1.806193 | -0.006884 | 0.003472 | 0.944767 | -0.042985 | 4 | 1 | 0 |
| 3 | -0.057165 | 0.608518 | 0.125661 | -0.211755 | -0.060142 | 0.486712 | -0.042985 | 4 | 1 | 0 |
| 4 | -0.057165 | -1.643336 | 1.343471 | -0.189653 | -0.183750 | 0.143171 | -0.042985 | 2 | 1 | 1 |
df_scaled.label_db.value_counts()
0 13174 1 4800 -1 773 Name: label_db, dtype: int64
attributes =df_scaled.columns[0:7]
color={0:'purple',1:'yellow',-1:'blue'}
for index in color:
fig = go.Figure()
fig.add_trace(go.Scatterpolar(r=np.mean(df_scaled.loc[df_scaled.label_db == index, attributes]),
theta=attributes,
fill='toself',
fillcolor=color[index],
opacity=0.4,
name='cluster'+str(index)))
fig.update_layout(polar=dict(radialaxis=dict(visible=True)),showlegend=True,
title="Radar plot -DBSCAN Cluster"+str(index))
fig.show()
The above plots show three clusters
In conclusion in K-means we had 8 clusters out of which 2 had well defined behaviours and now using DBSCAN we have obtained 3 clusters with two of them corresponding to the other 2 defined by K-Means
# Comparing the silhoutte score
compare=pd.DataFrame({'Algorithm':['K-means','Gausian Mixture','DBSCAN'],'Silhoutte Score':[ss_k,ss_g,ss_d]})
compare
| Algorithm | Silhoutte Score | |
|---|---|---|
| 0 | K-means | 0.429659 |
| 1 | Gausian Mixture | 0.404078 |
| 2 | DBSCAN | 0.416331 |
The performance of all the three clustering algorithm was fairly similar on the given dataset. We define the conclusion for each algorithm here
We conclude that different algorithms give different clusters of varying behaviour, it's then upto the stakeholders to decide which cluster they will be focusing on for further analysis.